Article 5313

Title of the article

ALGORITHMS OF INVERTED INDEX CONSTRUCTION FOR TEXT DATA COLLECTION 

Authors

Trifonov Aleksandr Aleksandrovich, Postgraduate student, Penza State University (40 Krasnaya street, Penza, Russia), alexander.a.trifonov@gmail.com 

Index UDK

004.046 

Abstract

Background. With the rapid growth of the information in the in the XXI century it has become increasingly difficult to find and select useful and qualitative data. This is the so called "adverse selection" effect and a lot of "noise." These phenomena are especially visible in the global computer network. When creating information retrieval systems the sizes of the processed text collections are often very high and that demands improved methods and tools for building search engines. Methods and tools for data analysis of textual information are used for selection of the high quality textual information. So that to avoid a sequential survey of the texts when fulfilling each request, the inverted index of documents, which assigns to the terms of those documents from the collection, in which they occur is compiled in advance. The purpose of this paper is a detailed analysis of the existing algorithms for an inverted index construction for the text collection, their advantages and disadvantages. In addition, it is necessary to compare the time complexity of the analyzed
algorithms, which allows to draw conclusions about the validity of each of them in a particular case.
Materials and methods. In information retrieval systems, the solutions depend on the characteristics of computer software, which system will be deployed. Index defining methods can be divided into two categories: the first group is based on the memory, and the second is based on the disc. Research data show that the performance of algorithms for the index construction is highly dependent on the amount of memory available to the process of indexing. Taking into account the specificity of the indexing algorithm, it makes sense to compare their complexity when the volume of the collection is greater than or less than the amount M of RAM PC running the indexing.
Results. The time complexity of the analyzed algorithms for an inverted index construction for a text collection based on various parameters has been investigated. An inverted index is usually too large to be completely loaded into the memory. If the amount of memory available to the process of indexing is too small to allow the index to be created entirely in memory, the described method of constructing the index in memory can be expanded up to the merge-based method in which a set of dynamic text is divided into subsets based on
the amount of RAM available. Comparison of the time complexity of the algorithms depending on the amount of memory in the PC that is running the indexing, allows to draw conclusions about the validity of each of them in a particular case.
Conclusions. Sort-based index construction should be applied when the amount of text collection is not more than a few gigabytes. When the volume of the collection is increasing the construction of the index, based on the merger seems more efficient. 

Key words

information retrieval, text collection, inverted index, index construction.

Download PDF
References

1. Manning K., Ragkhavan P., Shyuttse Kh. Vvedenie v informatsionnyy poisk: per. s angl. [Introduction into data search: translation from English]. Moscow: Vil'yams, 2011, 528 p.
2. Büttcher, S. Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, 2010, 606 p.
3. Lande D. V., Snarskiy A. A., Bezsudnov I. V. Internetika. Navigatsiya v slozhnykh setyakh. Modeli i algoritmy [Internetics. Navigation in complex networks. Models and algorithms]. Moscow: Librokom, 2009, 264 p.
4. Leont'eva N. N. Avtomaticheskoe ponimanie tekstov: sistemy, modeli, resursy: ucheb. posobie dlya stud. lingv. fak. vuzov [Automatic text recognition: sistems, models, resources: tutorial of linguistic faculty students]. Moscow: Akademiya, 2006, 304 p.

 

Дата создания: 28.08.2014 13:58
Дата обновления: 28.08.2014 15:08